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Abstract —This paper introduces a spiking hierarchical model 
for object recognition which utilizes the precise timing informa¬ 
tion inherently present in the output of biologically inspired asyn¬ 
chronous Address Event Representation (AER) vision sensors. 
The asynchronous nature of these systems frees computation and 
communication from the rigid predetermined timing enforced 
by system clocks in conventional systems. Ereedom from rigid 
timing constraints opens the possibility of using true timing to 
our advantage in computation. We show not only how timing can 
be used in object recognition, but also how it can in fact simplify 
computation. Specifically, we rely on a simple temporal-wlnner- 
take-all rather than more computationally intensive synchronous 
operations typically used in biologically inspired neural networks 
for object recognition. This approach to visual computation 
represents a major paradigm shift from conventional clocked 
systems and can find application in other sensory modalities and 
computational tasks. We showcase effectiveness of the approach 
by achieving the highest reported accuracy to date (97.5% ±3.5%) 
for a previously published four class card pip recognition task 
and an accuracy of 84.9% ±1.9% for a new more difficult 36 
class character recognition task. 


I. Introduction 

This paper tackles the problem of object recognition using 
a hierarchical Spiking Neural Network (SNN) structure. We 
present a model developed for object recognition, which we 
have called HFirst. The name arises because the approach 
extensively relies on the first spike received during compu¬ 
tation to implement a non-linear pooling operation, which 
is typically required by frame-based Convolutional Neural 
Networks (CNNs). 

We rely on the biological observation that strongly activated 
neurons tend to fire first a, a. In particular, we focus on the 
relative timing of spikes across neurons, namely the order in 
which neurons fire. We will argue that such a scheme allows us 
to derive temporal features that are particulary suited for robust 
and rapid object recognition at a very low computational cost. 
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Existing work on artificial neural networks tend to assume 
a predetermined timing which is completely independent of 
the processing taking place. This prohibits these artificial NNs 
from using time in their computation. However, the timing of 
communication (spikes) in biological networks is known to be 
very important. Much like biological networks, in this paper 
we exploit spike timing to our advantage in computation. More 
specifically we rely on the time at which a spike is received 
to implement a simple non-linear operation which replaces the 
more computationally intensive maximum operation typically 
used in non-spiking neural networks for visual processing. 

Artificial Neural Networks (NNs), of which CNNs are a 
subset, have successfully been used in many applications, 
including signal and image processing El, a, and pattern 
recognition 0, while hardware acceleration of such models 
allows real-time operation on megapixel resolution video 0. 
Although CNN models are argued to be biologically inspired, 
their artificial implementations are typically far removed from 
biological neural networks, most of which consist of spiking 
neurons. 

Spiking Neural Networks (SNNs) have received a lot of 
attention recently as new, more efficient computing technolo¬ 
gies are sought as conventional CMOS technology approaches 
its fundamental limits. SNNs have the potential to achieve 
incredibly high power efficiency. This is not a claim that 
we provide our own evidence for, but is rather based on 
observations of power consumption in biology (the human 
brain consumes only 20W) and recent works which present 
SNNs on chip with impressive power efficiency. Examples 
include Neurogrid 0 and IBMs TrueNorth 0 which can 
simulate 1 million spiking neurons while consuming under 
lOOmW. In this paper we address the question of how SNNs 
can be used for visual object recognition. 

Modern reconfigurable custom SNN hardware platforms 
can implement hundreds of thousands to millions of spik¬ 
ing neurons in parallel. Examples of these hardware im¬ 
plementation projects include the Integrate and Eire Array 
Transceiver (lEAT) 0, Hierarchical AER-IEAT ifToll . Brain 
Scales in, Spiking Neural Network Architecture (SpiN- 
Naker) llT2l . Neurogrid 0, Qualcomm’s Zeroth Processor, 
and IBM’s TrueNorth 0 (fabricated with Samsung). 

In parallel with these hardware platforms, software plat¬ 
forms for neural computation have emerged, including the 
Neural Engineering Eramework (NEE) O, Brian m, and 
PyNN HSl, many of which can be used to configure the 
hardware platforms previously mentioned. Continued interest 
and funding from the European Union’s Human Brain Project 
M and the USA’s Brain Research through Advancing In¬ 
novative Neurotechnologies (BRAIN) project El will drive 
development of such systems for years to come. 
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As neural simulation hardware matures, so must the algo¬ 
rithms and architectures which can take advantage of this 
hardware. However, it does not necessarily make sense to 
directly convert existing computer vision models and algo¬ 
rithms (which process traditional frame based data) to SNN 
implementation. A central concept within SNNs is that spike 
timing encodes information, but frames do not contain precise 
timing information. The timing of the arrival of frames is 
purely a function of the front end sensor and is completely 
independent of the scene or stimuli present. In order for a 
SNN to exploit precise timing, it must operate on data which 
contains precise timing information and not spike timings 
artificially generated from frame-based outputs. To obtain 
visual data with precise timing, we turn our attention to 
asynchronous AER vision sensors, sometimes referred to as 
“silicon retinae’ ’m, mi. These sensors more closely match 
the operation of biological retina and do not utilize frames. 

Asynchronous AER vision sensors have seen much im¬ 
provement since their introduction in the early 1990s by 
Mahowald ll20l . Modern change detection AER sensors re¬ 
liably provide information on changes of illumination at the 
focal plane over a wide dynamic range and under a variety 
of lighting conditions. The pixels within such sensors each 
contain a circuit which continuously performs local analog 
computation to detect the occurrence and time of changes in 
intensity for that particular pixel. This computation at the focal 
plane is a form of redundancy suppression, ensuring that pixels 
only output data when new information is present (barring 
some background noise). Furthermore, the time of arrival of 
data from the sensor accurately represents when the intensity 
change occurred. Under test conditions sub-microsecond accu¬ 
racy is achieved, versus accuracy on the order of milliseconds 
for fast frame-based cameras. This temporal accuracy provides 
precise spike timing information which can be exploited by a 
SNN. 

Much like SNNs are a more accurate approximation of 
biological processing hardware, AER vision sensors are a more 
accurate approximation of the biological retina. The single bit 
of data provided by a pixel can be likened to a neural spike, 
and much like a biological retina, the AER sensor performs 
computation at the focal plane. Notable examples of spiking 
AER vision sensors include the earliest examples of spiking 
silicon retinae by Culurciello et al. in and Zaghloul et al. 
m, as well as the more recent Dynamic Vision Sensor (DVS) 
from Delbruck ED, the sensitive DVS from Linares-Berranco 
ll22l . and the Asynchronous Time-based Image Sensor (ATIS) 
from Posch E^ . Operation of these sensors will be discussed 
in Section |II] For a review of asynchronous event-based vision 
sensors see Delbruck et al. Il24l . 

With the emergence of these asynchronous vision sensors, 
many researchers have taken an interest in processing their 
data in a manner which takes advantage of the asynchronous, 
high temporal resolution, and sparse representation of the 
scene they provide. Models of early visual area VI, including 
saliency, attention, foveation, and recognition Il25l-I27l have 
been implemented by combining the reconfigurable IFAT 
system ll28l with the Octopus silicon retina mi. More recent 
focuses in the field include stereo vision E9l - ll3T]| . motion 


estimation ll^ . ll^ . tracking ll34ll . and more object recog¬ 
nition works I^ - IIJTI . Further information on neuromorphic 
sensory systems can be found in Liu and Delbruck 13^ . 

In this paper we focus on the task of object recognition. 
The most similar recent works include a VLSI implementa¬ 
tion of the HMAX model ED, HOll for recognition which 
uses spiking neurons throughout ED. The VLSI spiking 
HMAX implementation computes all the functions required by 
HMAX, but operates on 24x24 pixel images, limited by the 
number of available neurons, and does not run real-time. Adap¬ 
tations of frame-based CNN techniques for training SNNs and 
implementing them in FPGA have also been recently presented 
E2, including a recent PAMI paper ll^ which presented a 
high speed card pip recognition task which we also tackle in 
this paper as a comparison to existing works. 

In this paper we present our SNN architecture dubbed 
“HFrrst”, which takes advantage of timing information pro¬ 
vided by AER sensors. A key aspect is that our architecture 
uses spike timing to encode the strength of neuron activation, 
with stronger activated neurons spiking earlier. This enables 
us to implement a MAX operation using a simple temporal 
Winner-Take-All (WTA) rather than performing a synchronous 
MAX operation as is typically done in frame-based algorithms 
Unlike the frame-based MAX operation, which outputs 
a number representing the strength of the strongest input, the 
temporal WTA can only output a spike, but by responding 
with low latency to its inputs, the temporal WTA preserves 
the time encoding of signal strength. It should be noted that 
other methods of implementing a MAX operation in spikes 
have been presented previously ED. 

Masquelier et al. B3l also use a temporal WTA, but their 
approach focuses on static images and spike generation from 
these images is artificially simulated, whereas we use AER 
vision sensors ED-ES to directly capture data from dynamic 
scenes for recognition. Additionally, Masquelier et al. require 
their network to be reset before a second object can be 
recognised, whereas HFirst operates on streaming “video” 
and can recognise multiple objects in sequence, or even 
simultaneously. 

The HFirst model described here can be used with many 
of the available AER change detection sensors, and could be 
implemented on one of many neural processing platforms. 
For this particular work we analysed HFirst in simulation 
using a combination of C and Matlab on a desktop PC. Once 
simulated, the SNN was implemented in real-time on a Xilinx 
Spartan 6 XC6SLX150-2 FPGA. 

The rest of this paper is organized as follows. In the next 
section we briefly describe the event-based vision sensors, 
then we describe the neuron model using spike timing for 
computation in Section|III] The HFirst architecture is described 
in Section IIVI followed by brief analysis of the required 
computation and real-time implementation. Testing and results 
are then presented to showcase the model accuracy before 
wrapping up with discussions and conclusions. 
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Fig. 1. Event-based vision sensor acquisition principle, (a) typical signal 
showing the log of luminance of a pixel located at {u,v\^. Dotted lines show 
how the thresholds for detecting increases and decreases in intensity change 
as outputs are generated, (b) asynchronous temporal contrast events generated 
by this pixel in response to the light vaiiation shown in (a). 

II. Asynchronous Change Detection Vision 
Sensors 

Neuromorphic, event-based vision sensors are a novel type 
of vision sensor driven by changes within the visual scene, 
much like the human retina, and differs from conventional 
image sensors which use artificial timing to control infor¬ 
mation acquisition. The sensors used in this paper ED-Ca 
consist of autonomous pixels, each asynchronously generating 
spike events that encode relative changes in illumination. 
These sensors capture visual information at a much higher 
temporal resolution than conventional vision sensors, achiev¬ 
ing accuracy down to sub-microsecond levels under optimal 
conditions. Moreover, since the pixels only detect temporal 
changes, temporally redundant information is not captured 
or communicated, resulting in a sparse representation of the 
scene. Captured events are transmitted asynchronously by the 
sensor in the form of continuous-time digital words containing 
the address of the activated pixel using the AER protocol ll20ll . 

To better understand the operation of these sensors we 
will briefly provide a formulation to approximate the sensor 
response to visual stimuli. Let us define I{u,v,t) as the 
intensity of a pixel located at where u and v are 

spatial co-ordinates in units of pixels. Each pixel of the sensor 
asynchronously generates events at the precise time when 
change in the log of the pixel illumination Alog{I{u,v,t)) 
is larger than a certain threshold AI since the last event, as 
shown Eig. [TJa) and (b). The logarithmic relation means the 
pixels respond to percentage changes in illumination rather 
than the absolute magnitude of the change. This allows pixels 
to operate over a very wide dynamic range (>120dB). 

Under constant scene illumination the intensity changes 
seen by the sensor are due to the combination of a spatial 
image gradient and a component of image motion along that 
gradient. As described by the equation below which is a first 
order approximation of the image constancy constraint. 

dl{u,v,t) _ dl{u,v,t) du dl{u,v,t) dv (j) 
dt du dt dv dt 

where I{u,v,t) is intensity on the image plane, and u and v 
are horizontal and vertical coordinates measured in units of 
pixels. The sensor will therefore generate the most events at 



Fig. 2. Operation of an Integrate-and-Fire neurons (IF neurons) used, showing 
how synaptic weights and time affect the neuron membrane potential, as well 
as the operation of lateral reset connections (lateral meaning connecting to 
other neurons in the same layer) and refractory period. 

locations where a large image gradient is present, as will be 
discussed further in Section IIII-BI 

III. Computing with Neurons 
A. Neuron model 

The neuron model we use is a simple Integrate-and-Eire 
neuron (IE neuron) m with linear decay and a refractory 
period, as shown in Eig. |2] We foresee that the model would 
translate to hardware implementations which model many 
neurons in parallel, but the neurons in such hardware imple¬ 
mentations may have very limited precision. To account for 
the possible limited precision in implementation, in software 
we simulate subthreshold membrane potential decay with 1ms 
time precision and restrict all neuron parameters {Vthresh, 
and trefr in Table H]) to be unsigned 8 bit integers with 1 
Least Significant Bit (LSB) corresponding to 1 unit shown in 
Table U During simulation, membrane potential is stored as 
an integer value in units of millivolts. 

The simple behaviour of IE neurons ensures that an output 
spike can only be elicited by an excitatory input spike, and not 
by subthreshold membrane potential dynamics in the absence 
of excitatory input. When an input to a neuron arrives, the 
neuron’s new state (membrane potential) can be entirely deter¬ 
mined by the time since it was last updated, and its state after 
the previous update. We therefore need only update a neuron 
when it receives an input spike (rather than at a constant time 
interval). Neurons are organized into a hierarchical structure 
consisting of layers. When an input spike arrives from a lower 
layer, the update procedure for the neuron is: 

if ti t lasts pike ^ Irefr thCIl 
Vmj ^ Vmt^i 

else 

I max{yOT,_i — ifym,_i>0 

‘ I min{ym,_i-f ifym,_i<0 

Vmi ^Vmi + (Oi 

end if 
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if Vnii > Vthresh then 
Vnii <^0 
Uastspike ^ U 

Do(Generate Output Spike) 

end if 

where f, is the time at which the input spike arrives, 
Uastspike is the time at which the current neuron last generated 
an output spike, f^/r is the refractory period of the neuron, 
Vnii is the membrane voltage after the input spike, /; is the 
leakage current, C,„ is the membrane capacitance, o,- is the 
input weight of the input spike, and Vthresh is the threshold 
voltage for the current neuron. 

Output spikes from a neuron feed similarly into the layer 
above, but can also affect neurons within the same layer 
through lateral connections. When an input is received from a 
lateral connection, it forces the receiving neuron to reset and 
enter a refractory period. In practice we implement this by 
treating the reset neuron into thinking it has recently spiked 
by using the update: 

Uastspike ^ ^ 

where t is the current time. 

B. Using Spike Timing to Find the Max 

Jarrett et al. mu showed in a comparison of object recogni¬ 
tion architectures that the top performing algorithms are those 
with a hierarchical structure incorporating a non-linearity, 
although some more recent works show similar performance 
with a single layer of neurons, but at the expense of increased 
computational complexity and training difficulty 1461 . In the 
case of the popular HMAX 1^ model, this non-linearity is 
a maximum operation in the pooling stages (Cl and C2). 
Finding this maximum requires comparing the responses of 
all units within the region to be pooled. This maximum value 
is then passed through to the next layer, irrespective of how 
large or small the value is. In other words, the maximum 
value is passed to the next layer, regardless of its value (so 
long as it is the maximum). 

In the HFirst architecture we observe which neuron re¬ 
sponds first, and judge that neuron to have the maximal 
response to the stimulus. This is based on two main observa¬ 
tions. Firstly, that sharper edges (larger spatial gradients) result 
in larger temporal contrast O, therefore generating events 
sooner than less sharp edges. Secondly, the higher the spatial 
correlation between a neuron’s input weights and the spatial 
pattern of incoming spikes, the stronger it will be activated (see 
Fig. |3ll. The strongest activated neuron will cross its spiking 
threshold before other neurons, thereby providing an indication 
that its response is strongest. Using this mechanism there is 
no need to compare neuron responses to each other, rather we 
simply observe which neuron generated an output first. The 
first spike from a pooling region can then be used to reset other 
orientations through lateral reset connections, thereby ensuring 
that non-maximal responses are not propagated through to 
subsequent layers. 

Fig. 13 shows how neurons tuned to different orientations 
will respond when an edge is presented. The neuron tuned 



Fig. 3. Competition between neurons tuned to different orientations when 
presented with a visual edge oriented at 90 degrees. The neuron tuned to 90 
degrees is strongest stimulated causing it to cross spiking threshold first and 
reset all other orientations. 


to the orientation of the edge (90 degrees, solid line) is 
strongest activated and crosses the spiking threshold before 
other neurons (dotted lines). Neurons tuned to orientations 
similar to the stimulus (75 and 105 degrees) are next strongest 
activated, but are reset by the neuron sensitive to 90 degrees 
(since it spiked first). Neurons tuned to orientations below 45 
degrees and above 135 degrees are not shown to reduce figure 
clutter. 

The “time to first spike” approach simplifies computation 
of the max. It indicates which neuron has the strongest 
response, and through the time at which the spike is elicited it 
conveys how strong the response is. However, if no neuron was 
activated strongly enough to generate an output spike, no first- 
spike is detected and no output spikes are generated. This is an 
important property ensuring that no computation is performed 
when there is insufficient activity in the scene. Much like the 
front end sensor, which represents lack of stimulus (temporal 
contrast) through a lack of data, HFirst represents the lack of 
a strong enough neuron activation through a lack of output 
spikes. 


IV. Asynchronous HFirst Architecture 

HFirst is structured in a similar manner to hierarchical 
neural models ED, is, which consist of four layers, named 
Simple 1 (SI), Complex 1 (Cl), Simple 2 (S2), and Complex 
2 (C2). In these frame base architectures, cells in simple layers 
densely cover the scene and respond linearly to their inputs, 
while cells in complex layers have a non-linear response and 
only sparsely cover the scene. The layers and manner in which 
computation is performed in HFirst differs considerably from 
previous implementation of similar computational models of 
object recognition in cortex ll39ll . HtI . BS). The Simple 
layers in HFirst are in fact non-linear due to the use of a 
spike threshold and binary spike output. In the remainder 
of this section the form and function of each HFirst layer 
is described. The same neuron model is used for all layers, 
but with different parameters and connectivity. The network 
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TABLE I 

Neuron Parameters 


Layer 

^thresh 

h/c,,, 

^refr 

Kernel Size 

Layer Size 

SI 

200 

50 

5 

7x7x1 

128x128x12 

Cl 

1 

0 

5 

4x4x1 

32x32x12 

S2 

100-200 

10 

10 

8x8x12 

32x32xN,. 

C2 

1 

0 

10 

32x32x1 

iXlXNy 

unit 

mV 

mV/ms 

ms 

synapses 

neurons 


synaptic weights. 9 varies from 0 to 165 degrees in increments 
of 15 degrees. 

SI neurons are divided into adjacent non-overlapping 4x4 
pixel regions, referred to as SI units. Each SI unit feeds into 
12 Cl neurons, one for each orientation. Cl neurons have 
lateral reset connections between orientations to perform the 
max operation discussed in Section Illl-BI Cl neurons use a 
very low threshold voltage to ensure that a single input spike 
is sufficient to generate an output spike (provided the neuron 
is not under refraction). 

The refractory period in Cl saves computation by reducing 
the number of spikes which need to be routed within the 
architecture. Limiting the firing rate is also important to ensure 
that no single Cl neuron can fire rapidly enough to single 
handedly elicit a spike from an S2 neuron. 


Fig. 4. The HFirst model architecture, consisting of four layers (SI, Cl, S2, 
C2). Only a 32x32 pixel cropped region of real data extracted from the model 
is shown to ease visibility while demonstrating recognition of the character 
■R’. Black dots represent data from the model. The character ‘R’ has been 
superimposed on top of the SI and Cl data to aid explanation. The size of the 
(cropped) data is shown at the left of each layer (Table [!] shows the sizes for 
the full model). The SI layer performs orientation extraction at a fine scale, 
followed by a pooling operation in Cl. Note that due to lateral reset in Cl, 
some SI responses are blocked (for example, the last three orientations on the 
bottom row). The S2 layer combines responses from different orientations, 
but maintains spatial information. The C2 layer pools across all S2 spatial 
locations, providing only a single output neuron for each character. 


architecture is shown in Fig. IH and the parameters for each 
stage are shown in Table U 


A. Layer 1: Gabor Filters 

The SI layer densely covers the scene with even Gabor 
filters at 12 orientations. All filters are 7x7 pixels, resulting 
in 12 filters at each pixel. These filters are designed to 
pick up sharp edges. Filter kernels are generated with the 
same equation as in Serre et al. 1^ . repeated below for 
convenience. 




Fe{u,v) 

= e^ 2(t^ cos(^mo) 

Mo 

= MCOS0 -1-vsin0 

Vo 

= —Msin0 -f VCOS0. 


where u and v are horizontal and vertical location in pixels. 
mq and Vo are used to effect a rotation which orients the filter. 
Parameters of A = 5 and <7 = 2.8 were used to generate the 


B. Layer 2: Template Matching 

S2 neurons densely cover Cl neurons, with each receiving 
inputs from 8x8 Cl neurons of all orientations. S2 receptive 
fields are created during a training phase as described below. 

A simple activity tracker lf34l is used to track training 
objects and compensate for their motion to generate a static 
32x32 pixel view of the object. This stabilised view is 
processed by SI and Cl, and the number of spikes of each 
orientation originating from each Cl neuron is counted. Note 
that due to the non-overlapping S1 units, the 32 x 32 pixel input 
region feeds into 8x8 Cl neurons, which is the size of an S2 
receptive field in HFirst (see Table |I|i. 

The counts generated in this manner constitute the synaptic 
weights (or input kernel) for the S2 neuron sensitive to this 
object. A separate neuron is required for each object to be 
recognized. For each neuron, synaptic weights are normalised 
to have an I 2 norm of 100. Finally, since negative spike counts 
are not possible, all zero valued weights are replaced with 
inhibitory values (-1) to reduce noise sensitivity. A copy of 
each trained neuron is then implemented at every location, 
allowing detection of all trained objects at all locations. 

Fig. 13 shows an example of a learnt S2 receptive field 
for recognizing the character ‘G’. The figure shows how the 
highest input synapse weights are assigned to locations where 
the orientation of character’s edges match the orientation to 
which the underlying Cl neurons are tuned. 

S2 neuron spikes reset all other S2 neurons within an 8x8 
region sensitive to other classes of objects, thus implementing 
the max operation discussed in Section ITlI-BI Furthermore, by 
only resetting neurons sensitive to other object classes, the 
detected object class is given a “head start” in the race to 
first spike in the nearby region. This can be seen as using the 
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A parallelised and pipelined FPGA implementation was 
programmed to run in real-time on the Opal Kelly XEM6010- 
LX150 board, which includes a Xilinx Spartan 6 XC6SLX150- 
2 FPGA. The model operates on a 128x128 pixel input. The 
implementation runs at a clock frequency of lOOMHz and 
uses internal block RAM without relying on external RAM. 
The final output of the system consists of S2 output spikes, 
although access is also provided to spikes from intermediate 
layers for characterization. 


Fig. 5. The receptive field of an S2 neuron trained to recognize the character 
‘G’. The neuron receives inputs from an 8x8 (xxy) Cl region and from all 
12 orientations (orientations are indicated by the oriented blacked bars). Dark 
regions indicate strong excitatory weights and can be seen to fall along edges 
of the character wherever edge orientation matches the Cl neuron orientation. 
Weaker response between 135 and 165 degrees (bottom right) are due to the 
direction of motion of the character during training (roughly 150 degrees). 
Motion perpendicular to the direction of motion is required to elicit temporal 
contrast, as shown in m and discussed in Section |n] After normalization the 
weights in this example range from -ImV to 33mV, indicated by the bar on 
the right. 

detection to create a prior expectation of detecting that object 
again nearby. 

An optional C2 layer can be used to pool all responses from 
all S2 locations for classification. The C2 layer is not always 
used because it discards information regarding the location 
of the object, which can be particularly useful when multiple 
objects of interest are simultaneously present in the scene. 

C. Classifier 

A basic classifier outputs the soft probabilities for the object 
belonging to each class. The probability P{i) of an object 
belonging to class i is calculated as 


where u, is the number of spikes elicited by S2 neurons 
sensitive to the i'* class. When = 0 we assign P{i) = 0 
for all classes. 

If we wish to force the classifier to choose only a single 
class, we can assign the output class y as 

y = max(n,) (4) 

i 

We have no neuron to respond to lack of an object in a 
scene. Lack of an object results in lack of positive detections. 
This is a fundamental concept of the computing and sensing 
paradigm we use. Lack of information is not communicated, 
but is rather represented by a lack of communicated data. 

V. Implementation 

In this section we briefly analyse computational require¬ 
ments. The number of input spikes generated by the front 
end sensor varies with scene activity and dictates the required 
computation since neuron updates are only performed when 
spikes are received. We analyse computation as a function of 
the number of input and output spikes for each layer. A worst 
case scenario is used which assumes that a neuron is updated 
every time it receives a spike (ignoring the refractory period). 


A. SI and Cl: Gabor Filters 

Each input spike in SI routes to all SI neurons within a 7x7 
pixel region. There are 12 SI neurons at each pixel location 
(one per orientation), resulting in 12x7x7 = 588 synapse 
activations per input spike. 

For FPGA implementation, 84 synapses update in parallel, 
requiring 7 clock cycles to update all 588 synapses, allowing 
the SI stage to sustain throughput of 14M events per second. 

Each SI output spike excites a single Cl neuron, and resets 
the 11 Cl neurons sensitive to other orientations, resulting in 
12 Cl synapse activations per SI output spike. Cl updates all 
12 synapses in parallel and can process 25M input events per 
second. 

B. S2 and C2: Template Matching 

Each input spike to S2 routes to all S2 neurons within 
an 8x8 region. If Ny denotes the number of classes to be 
classified, then there will be Ny neurons at each location, and 
each input spike will activate NyX%xd> = bANy input synapses. 
Each S2 output spike resets all S2 neurons lying within an 
8x8 region around where the spike originated. So, for every 
S2 output spike Ay x 8 x 8 = 64Av S2 lateral reset synapses are 
activated. 

The FPGA implementation of S2 can update Ny neurons in 
parallel, requiring 64 clock cycles to process each input or 
output spike. The number of C2 input synapses activated is 
equal to the number of S2 neuron output spikes. The C2 stage 
is optional and not implemented in FPGA. 

In HFirst there are no zero valued synapses in S2. Synapses 
which are not activated during training are assigned an in¬ 
hibitory synaptic weight of -1 (see Section llV-BI) . The number 
of synapses in S2 could be significantly reduced by instead 
assigning a weight of 0 to these synapses and optimizing them 
out of the model. However, such an optimization would in¬ 
troduce significant additional complexity in pipelining for the 
FPGA implementation. The FPGA implementation benefits far 
more from the simplified pipelining which results from having 
a dense regular connection structure where all synapses are 
implemented. This connection structure is also more general, 
allowing the synaptic weights to be easily reprogrammed. 

The regular connection structure also saves memory by 
ensuring that when a neuron is updated, all co-located neurons 
will also be updated. Updating all co-located neurons simulta¬ 
neously allows us to store only a single time value to indicate 
when all neurons at that location were updated, rather than 
storing a separate time value to indicate when each individual 
neuron was last updated (Section IIII-AI shows how the time 
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TABLE II 

Required Computation and Resources 


Stage 

SI 

Cl 

S2 

C2 

Synapse updates per event 

588 

12 

64/V,, 

1 

Throughput events/sec 

14M 

25M 

IM 

lOOM 

DSP blocks 

16 

0 

1 

0 

Block RAM 

128 

2 

Av+l 

0 


value is used in the neuron update). This memory saving is 
important because memory availability is the limiting factor in 
scaling the model to higher resolution, as shown in the next 
section. 


C. Scaling to higher resolution 

When scaling to higher resolutions two main factors need 
to be considered: memory requirements, and computational 
requirements. Required memory scales linearly with the num¬ 
ber of neurons in the model, which in turn scales linearly with 
the number of input pixels. Computational requirements scale 
linearly with the input event rate. 

With 36 classes (Ny = 36), 167 Block RAMs are used for 
HFirst (see Table Hill, plus an additional 10 for pipeline FIFOs 
and USB lO, resulting in a total of 177 of the available 268 
Block RAMs being used for 128x128 pixel input resolution. 

Digital Signal Processing (DSP) blocks are blocks within 
the FPGA containing dedicated hardware for performing mul¬ 
tiplication and addition. The number of multiplications which 
can be performed per second is a limiting factor in many 
algorithms, particularly for visual processing algorithms which 
compute kernel responses using convolution. Optimization of 
these algorithms typically involves optimizing memory access 
and pipelining to maximise utilization of hardware multipliers 
(see Il49l for an example). High end GPUs and FPGAs contain 
thousands of hardware multipliers. 

In HFirst only 17 of our FPGA’s 180 DSP blocks are used 
and these 17 DSP blocks are only utilized a small percentage 
of the time due to the temporal sparsity of the AER data. For 
HFirst, internal FPGA memory is the limiting resource when 
increasing resolution. Internal memory requirements scale with 
the input sensor resolution, while the number of DSP blocks 
required will scale with maximum sustained input event rate 
the model is required to handle. 

The current FPGA implementation can handle a sustained 
14Meps (events per second) input event rate, while bursts of 
up to lOOMeps (limited by FPGA clock speed of lOOMHz) 
can be handled for durations up to 5/is (limited by FIFO buffer 
depth). Larger FIFO buffers can be used, but are unnecessary. 
At 128 xl28 resolution, event rates for typical scenes are 
around IMeps. The latest ATIS can generate events at a peak 
rate of 25Meps, and sustain a maximum rate of ISMeps at 
304 x240 pixel resolution. 14Meps is therefore a very high rate 
for 128x128 pixel resolution. Using additional DSP blocks, 
the maximum sustainable event rate can be increased by 
IMeps per block used. 


D. Power Consumption 

The FPGA board on which HFirst was implemented also 
performs other tasks in parallel as part of normal operation of 
the ATIS sensor (powering and controlling the ATIS, as well as 
interfacing to a host PC). Implementing HFirst in addition to 
the other tasks on the FPGA increases power consumption by 
150mW for static scenes (little to no processing happening), 
and by a further lOOmW for the the highest activity scene we 
could generate. We therefore estimate HFirst power consump¬ 
tion to be between 150mW and 250mW depending on scene 
activity. These measurements are done at the board’s power 
supply and include losses due to inefficiencies in the onboard 
switching regulators. 

VI. Testing 

HFirst was tested on two tasks. The first consists of recog¬ 
nizing pips on poker cards as they are shuffled in front of the 
sensor. The poker card task has been previously tackled lf35ll 
and was chosen to provide a direct comparison with previously 
published works. The second task is a simulated reading task 
in which characters are recognized as they move across the 
field of view using the test setup shown in Fig. |6] Examples 
of recordings used for each task are shown in Eig. [T] For both 
tasks, HFirst was implemented in Matlab simulation, coupled 
with a reconfigurable C-n- function for increased speed. 

A. Poker cards 

For the poker card task data was provided by Linares- 
Barranco llTSl who captured the data using the sensitive DVS 
sensor ll22l . The dataset consists of 10 examples for each of the 
4 card types (spades, hearts, diamonds, and clubs). For each of 
10 different trials, non-overlapping test and training sets were 
chosen such that each contained 5 examples of each pip. For 
each pip in the training set, all 5 examples were concatenated 
into a single sequence from which the S2 layer kernel was 
generated. To provide a close comparison with the previously 
published task, we also tested on the stabilised and extracted 
pips. 

Additional tests were performed in which lateral reset 
connections were removed from the model to investigate the 
value of the timing approach to computing the max. Finally, 
the advantage of having orientation extraction and pooling in 
SI and Cl were investigated by bypassing these stages. 

B. Character Recognition 

36 characters (0-9 and A-Z) were printed on the surface 
of a barrel which was rotated at 40rpm while viewed by the 
DVS ED as shown in Fig. |6] Data was recorded over two 
full rotations of the barrel, thereby providing two recordings 
for each character. For each of 10 trials, non-overlapping 
test and training sets were randomly chosen such that every 
character appears once in each set. Training and testing was 
then performed using an automated script. 

Training of the second layer of HFirst is performed on a 
stabilised view of a moving object, and therefore requires 
knowledge of the object location, which is acquired through 












TABLE III 

Detection Accuracy and Required Computation 



Fig. 6. The test setup used to acquire the character dataset, consisting of a 
motorised rotating barrel covered with printed letters viewed by a DVS ED 
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Fig. 7. Examples of the stabilised characters and cards pip views used for 
training. Each example measures 32x32 pixels and shows 1.7ms of data. 


tracking. However, for testing we use moving sequences 
instead of stabilised views, removing the need for tracking. 

As with the card task. The character recognition task was 
also used to investigate the advantages of using reset con¬ 
nections for max computation, and of performing orientation 
extraction and pooling in SI and Cl respectively. 

Further testing was performed on the characters to show 
that HFirst can detect multiple objects simultaneously present 
in the scene, and to investigate the impact of timing jitter 
introduced during training and testing. 

Finally the importance of precise timing was investigated by 
artificially altering spike times in the recordings and observing 
the effect on HFirst accuracy. 

VH. Results 

Results from testing are summarised in Table [1111 and 
discussed in the sections below. The SI and Cl columns show 
the total number of activated synapses in each of these layers. 
For S2, the S2 and S2„, columns show the number of activated 
feedforward (from Cl) and lateral reset synapses respectively. 

A. Cards 

HFirst classified the stabilised and extracted card pips with 
an accuracy of 97.5%±3.5% using an S2 threshold of 150mV. 
Chance for this task is 25%. The average duration of a 


Task 

Accuracy 

% 

Inpu 

SI 

Synap 

Cl 

se Activa 
S2 

tions 

82;-^; 

HFirst Cards 






Full model 

97.5 ±3.5 

2.6M 

10k 

19k 

710 

No SI, Cl reset 

51.6±4.4 

2.6M 

3.8k 

127k 

79k 

No S2, C2 reset 

72.3 ±3.8 

2.6M 

10k 

19k 

- 

No reset 

24.9 ±0.1 

2.6M 

3.8k 

127k 

- 

Bypass SI 

49.1 ±5.4 

- 

4.3k 

37k 

20k 

Bypass SI and Cl 

60.7 ±3.5 

- 

- 

I.IM 

333k 

CNN Cards 






Spiking 1351 

91.6 

- 

- 

- 

- 

Frame based 1351 

95.2 

- 

- 

- 

- 

HFirst Characters 






Full model 

84.9 ±1.9 

8.4M 

40k 

720k 

159k 

No SI, Cl reset 

70.4±5.8 

8.4M 

8.3k 

2.4M 

309k 

No S2, C2 reset 

56.7 ±0.9 

8.4M 

40k 

720k 

- 

No reset 

4.6 ±0.1 

8.4M 

8.3k 

2.4M 

- 

Bypass SI 

31.2±4.1 

- 

14k 

1.6M 

I.IM 

Bypass SI and Cl 

81.4±3.8 

- 

- 

33M 

32M 


test example was 23ms, and consisted of 4.3k input spikes, 
which elicited 73 Cl, and 2.8 S2 spikes. The Sl/Cl and 
S2/C2 layers took on average 102ms and 0.7ms respectively 
per example to simulate in Matlab using a single thread on 
an Intel Xeon X5675 processor running at 3.07GHz. The 
FPGA implementation simulates the network in real-time, with 
latency < 2/ii in response to incoming events. 

Removing lateral reset in the first layer decreases recog¬ 
nition accuracy to 51.6%±4.4%, while removing lateral reset 
connections in the second layer decreases recognition accuracy 
to 72.3%±3.8%, and removing lateral reset connections in 
both layers reduces recognition accuracy to chance levels, 
while increasing the average number of spikes elicited to 309 
and 66 for Cl and S2 respectively. These results suggest that 
using the first spike mechanism improves performance, both in 
terms of computational efficiency, and in terms of recognition 
accuracy. 

For the card classification task which only has four output 
classes, bypassing the first layers reduces the required com¬ 
putation at the cost of recognition accuracy. 


B. Characters 

HFirst classified the moving letters with an accuracy of 
84.9%±1.9% using an S2 threshold of 200mV. Chance for 
this task is 2.8%. The average duration of a test example was 
112ms, and consisted of 14k input spikes, which elicited 313 
Cl, and 27 S2 spikes on average. The Sl/Cl and S2/C2 layers 
took on average 365ms and 28ms respectively per example to 
simulate in Matlab using a single thread on an Intel Xeon 
X5675 processor running at 3.07GHz. As with the card pip 
task, the FPGA implementation easily runs in real-time with 
latency < 2jis. 

Next we investigated the effects of bypassing the first layers 
of HFirst and performing template matching directly on the 
input events. This modification resulted in an accuracy of 
81.4%±3.8%, which is not too different from the performance 
of the full model. However, bypassing the SI and Cl layers 
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Fig. 8. HFirst S2 layer spikes (indicated by markers) over a 150ms time 
period in response to the chai'acter data. This figure shows the ability of HFirst 
to detect multiple characters in the scene simultaneously. Both location of the 
objects and their class are indicated by S2 spikes. The ‘X’, ‘F’, ‘Y’, and ‘G’ 
chai'acters are coiTectly detected, but the character ‘H’ is misclassified, being 
mistaken for an ‘F or ‘F’ at different times. 

also increases the required computation significantly, suggest¬ 
ing that performing orientation extraction and pooling in SI 
and Cl is actually more computationally efficient. The same is 
not true for the cards task where only 4 classes are present, but 
is true whenever 10 or more output classes are required. This 
increased computational requirement is also obvious when 
observing the time taken for simulation, which increased by 
50 fold to an average of 19.7 seconds per example in Matlab. 

C. Detecting Multiple Objects Simultaneously 

After testing the model performance on individual charac¬ 
ters, we verified that it can detect multiple characters simul¬ 
taneously present in the scene. Fig. [8] shows 150ms worth of 
S2 outputs with multiple characters simultaneously visible in 
the scene. The S2 responses indicate both the object class and 
location. In this example the letters ‘X’, ‘F’, ‘Y’, and ‘G’ are 
all accurately detected as they pass across the scene. Later, 
the letters ‘Z’ and ‘H’ enter the scene. The ‘Z’ is accurately 
detected, but the ‘H’ is erroneously detected as an ‘F’ and ‘F at 
different points in time. The 1 in 6 error for these characters is 
in agreement with the 84.9%±1.9% accuracy reported overall. 

Fig. |9] shows output detections for a single full rotation of 
the barrel, comparing the times at which letters were detected 
(or missed) to the ground truth of when they were present in 
the scene. 

D. Effect of Timing Jitter 

In the front end AER sensor, the latency of pixel responses 
and of the AER readout can vary, resulting in timing jitter in 
the spikes feeding into SI. All of our tests are performed on 
real recordings and therefore include some jitter. In order to 
investigate the effect of increased timing jitter on the model, 
we artificially added additional jitter to the recordings used 
for training and testing. Jitter times for each spike were 
randomly chosen from a Gaussian distribution and the effect 



Fig. 9. Detection of characters for a single rotation of the barrel. Only every 
second character is labelled on the vertical axis to reduce clutter. Red lines 
indicate when each character is present in the visual field, while blue crosses 
mark detections made by HFirst. Note that up to 4 characters are present in 
the scene at any one time. 


of varying the standard deviation of the distribution is shown 
in Eig. m Changing the mean of the Gaussian distribution 
adds a constant time offset to all spikes and has no effect 
on accuracy. The accuracy for each standard deviation value 
is again obtained as the mean of 10 random test and training 
splits performed on the character database. Two tests were run, 
in the first test additional jitter was introduced in the training 
data (Eig. fTOh l and the test data was left unaltered. In the 
second test (Eig. fTOb ) the training data was left unaltered and 
additional jitter was introduced only in the test data. 

Training is performed on tracked and stabilized views of 
the characters, thus for the purposes of training, the characters 
appears static. HEirst can therefore tolerate high timing jitter 
because even when a spike’s time is changed, it will still occur 
in the correct location relative to the center of the character. 
Accuracy drops off significantly only when the standard devi¬ 
ation of the jitter exceeds 100ms, which is comparable to the 
length of the recording itself (112ms). 

Recognition is performed on moving views of the characters 
which are crossing the field of view at roughly 1 pixel/ms. 
Delaying a spike by even a few milliseconds (Eig. [TOb ) will 
cause the spike to occur in the wrong location relative to 
the center of the character (because the character center will 
have moved during the delay period). Therefore, even a few 
milliseconds of timing jitter will cause a significant decrease 
in recognition accuracy. 

VIII. Discussion 

In this paper we have described a spiking neural network 
for visual recognition dubbed “HEirst”. HEirst exploits timing 
information in the incoming visual events to implement a 
time-to-first spike operation as a temporal Winner-Take-All 
(WTA) operation with lateral reset to block responses from 
other neurons in the same pooling area. Computationally, 
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a) 




Fig. 10. The effect of timing noise on recognition accuracy for the character 
recognition task. Adding Gaussian noise to the stabilized training data (a) has 
little effect on accuracy because even when delayed, spikes occur in the correct 
location relative to the character center. Accuracy drops off significantly only 
when the timing jitter is large enough to cause the training data spikes to 
be too spread in time. Adding even a small degree of Gaussian noise to the 
moving characters used for testing (b) causes accuracy to drop off significantly 
because by the time the delayed (jittered) spikes arrive at the SI inputs, the 
chai'acter has already moved on to a new location. 


this temporal WTA is significantly simpler than the MAX 
operation typically used in hierarchical models. 

HFirst operates on change detection data from AER sensors. 
Each pixel in these sensors adapts individually to ambient 
lighting conditions, which to a large extent removes depen¬ 
dence on lighting conditions. This removes the need for 
normalization of oriented Gabor responses in HEirst, which is 
another computationally intensive task (division) required by 
the standard UMAX model and other CNN implementations. 

Thus far HEirst has been tested on simple objects, and 
neurons in the second layer of HFirst directly detect the 
presence of these objects, allowing HFirst to simultaneously 
detect multiple objects in the scene, which is not typically 
possible with CNNs. 

Masquelier et al. ||43]| used STDP to learn more complex 
features, and a powerful Radial Basis Function (RBF) classifier 
which allows recognition of more complex objects (motorcy¬ 
cles and faces from Caltech 101). Their approach used STDP 
to extract features with high correlation between training 
examples, even though these features appear at different lo¬ 
cations. This removes the need to precisely track and stabilise 
a view of an object for training. However, the model only 
operates on static images, removing the problem of moving 
stimuli, and objects are already centered in the Caltech 101 
database (although features do not always appear at the same 
location). A second major difference is that HFirst operates 
continuously, whereas Masquelier et al. present images to their 
model sequentially, requiring the system to be reset before 
each image presentation. 

In a recent PAMI paper, Perez-Carrasco et al. llTSll reported 


an accuracy ranging from 90.1% to 91.6% for the card pip 
task using a five layer spiking CNN. They kindly provided 
us with their data and for the same task we report accuracy 
of 97.5%±3.5%. However, we compute accuracy differently 
to Perez-Carrasco et al.. Their CNN implementation includes 
separate “positive” and “negative” responses to represent the 
presence or absence for each object, and both these responses 
are used in their calculation of accuracy. HFirst has no 
“negative” responses, which prevents us from using the same 
equation. Instead, HFirst provides only positive responses, and 
does not respond when no objects of interest are present in 
the scene. Nevertheless, if we consider a lack of response 
from a neuron to be a “negative” response, then we can 
use the same equation. Doing so marginally increases our 
accuracy to 98.8%±1.9% because correct “negative” responses 
are rewarded, even when “positive” responses are incorrect. 

The card pip task was also used to investigate the benefits of 
including lateral reset, by showing that removal of lateral reset 
connections in the first, second, or both layers consistently 
reduces recognition accuracy, while simultaneously increasing 
computational requirements. 

Given the high accuracy of the full HFirst model on the 
card pip recognition task, a second more difficult character 
recognition task was constructed and was also used to inves¬ 
tigate the benefits of a multi-layer model. Bypassing the first 
layer decreased accuracy from 84.9%±1.9% to 81.4%±3.8%, 
suggesting that the first layer increases recognition accuracy. 
Perhaps more importantly, the first layer significantly reduces 
computational requirements for the character recognition task. 
The same was not true for the card recognition task because it 
consists of very few classes (4), but as the number of classes 
increases, so does the number of neurons in S2, therefore 
making it more important to have the SI and Cl layer to 
reduce the number of spikes reaching S2. 

The leaky integrate and fire neurons used in HFirst essen¬ 
tially perform coincidence detection on input spikes arriving 
in a specific spatial pattern. A neuron will only generate an 
output spike if enough input spikes matching this pattern 
are received within a sufficiently short time period. Under 
ideal circumstances (no noise), the projection of an object 
moving between two points on the focal plane will generate 
the same number of spikes from the AER sensor, regardless 
of the speed of the object. However, the speed of the object 
will determine the time period over which these spikes are 
generated, with slow moving objects not generating spikes at 
a high enough rate to elicit a response from HFirst layer 1 
neurons, but this can be overcome through active sensing, by 
using a small motion or vibration of the sensor to elicit an 
egomotion induced velocity on the image plane. 

IX. Conclusion 

We have presented an HMAX inspired hierarchical SNN 
architecture for visual object recognition dubbed ‘HFirst’. 
The architecture uses an SNN to exploit the precise spike 
timing provided by asynchronous change detection vision 
sensors to simplify implementation of a non-linear pooling 
operation commonly used in bio-inspired recognition models. 






11 


HFirst obtains the best reported accuracy on a card pip 
recognition test and results for a second, far more difficult 
character recognition task have also been presented. The low 
computational requirements of the HFirst model allow for real 
time implementation on an Opal Kelly XEM6010 FPGA board 
which interfaces directly with the vision sensor, and is both 
narrower and shorter than a credit card in size. 
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